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Abstract 

The increasing availability of high throughput data arising from gene expression 
studies leads to the necessity of methods for summarizing the available information. 
As annotation quality improves it is becoming common to rely on the Gene Ontology 
(GO) to build functional profiles that characterize a set of genes using the frequency 
of use of each GO term or group of terms in the array. In this work we describe a 
statistical model for such profiles, provide methods to compare profiles and develop 
inferential procedures to assess this comparison. An R-package implementing the 
methods is available. 
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1 Introduction 



DNA microarrays belong to recently developed technologies which allow to 
measure the expression of thousands of genes simultaneously, in a single ex- 
periment. It is expected that these experiments will contribute to solve many 
relevant biological problem s ranging from the identification of complex genetic 
diseas es, Alon et al. ( 1999f ). or the prediction of tumor type, Alizadeh et al. 



(2000), to target discovery the pharmaceutical industry. 



A common trait in these type of studies is the fact that they generate huge 
quantities of data and one may end with lists of up-to thousands of genes 
which need to be given a biological interpretation. 

A typical microarray experiment is one who looks for genes differentially ex- 
pressed between two or more conditions. That is, genes which behave differ- 
ently in one condition (for instance healthy [or untreated or wild-type] cells) 
th an in another ( for in stance tumour [or treated or mutant] cells). The study 
by iHengel et al. is of this type and will be used as an illustration of 



the ideas discussed in this paper. These authors showed that memory CD4+ 
T-cells behave differently if they present (CD62L+) or they lack (CDQ2L-) 
expression. In a study oriented to find the genetic regulation of these differ- 
ences they found 144 genes to be upregulated in CD4 + /CD62L— T-cells 
relative to CDA + /CD62L+ T-cells. Methods such as those presented here 
have been developed to contribute to the biological interpretation of the re- 
sulting lists of genes. To do this they rely on the Gene Ontology, which is 
described in the following. 



1.1 The Gene Ontology 



Attempts to perform a biological interpretation of these experiments are of- 
ten based on the Gene Ontology (GO), an annotation database created and 
maintained by a public consortium, the Gene Ontology Consortium 1 , whose 
main goal is, citing their mission, to produce a controlled vocabulary that can 
be applied to all organisms even as knowledge of gene and protein roles in cells 
is accumulating and changing. The GO is organized around three principles 
or basic ontologies: (1) Molecular function (MF), which describes tasks per- 
formed by individual gene products such as transcription factor or ATPase 
activity; (2) Biological process (BP), which describes broad biological goals, 
such as mitosis or purine metabolism, that are accomplished by ordered as- 
semblies of molecular functions, and (3) Cellular component (CC) describing 
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Gene ontology 




Fig. 1. Networks of ter^iW^GO 



subcellular structures, locations, and macromolecular complexes such as nu- 
cleus, telomere, and origin recognition complex. A given gene product may 
represent one or more molecular functions, be used in one or more biological 
processes and appear in one or more cellular components (see Figure 1). 

Figure 1 shows how a given gene product can be characterized using different 
related terms in each ontology. An important point to note is that the infor- 
mation here is not "linear" in the sense that, although there is a hierarchical 
relation, there are interrelations between levels and terms in each ontology. 
As a consequence an appropriate representation for this figure is a directed 
acyclical graph (DAG) and several analysis methods rely precisely on this 
representation. This is not the case of the methods discussed here. 
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1.2 Summarizing and interpreting microarray experimental results 



The GO is a rich data structure which contains a great quantity of information 
about the relation between terms. But due precisely to this, it is difficult to 
work with them, all at once. This fact has motivated that, in recent years, 
different approaches to GO-based analysis of the results of high throughput 
experiments have been considered. As a r esult of this effort many methods 
and even more tools have been developed. IMosquera and Sanchez-Pla (2005) 
is a review of the existing tools, jointly with the questions they try to solve. 



Typically, after having selected one list of interesting genes one can obtain 
the induced sub-graph, that is the graph formed by the subset of the Gene 
Ontology whose nodes are related to the genes in the list, either directly or 
through other nodes. These graphs may be big, complex, structures, specially 
when the list originating them is also big. In order to simplify this structure 
it may be sliced or projected on the nodes which are at a certain distance of 
the top node (what is called a level of the GO). This will originate a table 
of frequencies, with each cell containing the number of genes annotated by 
its corresponding term at the level where the slice has been done (see Figure 
2). The lattice structure of the graphs implies that one gene may appear in 
multiple cells of this table, which we call, from now on, a functional profile. 
Once this classification has been done there are different ways to analyze it 
which are briefly presented below. 



One straightforward possibility is to perform some type of enrichment anal- 
ysis which consists of a comparison in order to establish if the percentage 
of genes in a certain GO category has increased or decreased in the genes 
selected relatively to those in the population. If this is so, a biological expla- 
nation of the differences can be attempted based on this enrichment. This 
is usually done by means of a Fisher test or any of its variations and is 
performed category-wise for each of the groups in the level selected followed 
by some type of multiple t esting adjustment. 

Progr ams such as f atiGO dAl-Shahrour et all (ho04l)). DAV I D dDennis et al. 



(2003)) and many more (see Mosquera and Sanchez-Pla ( 20051 )) perform 
some form of this enrichment analysis. This is by far the most used ap- 
proach nowadays. 

The main characteristic of the previous approach is the fact that each cate- 
gory is compared separately. A reasonable complementary extension to this 
may be to consider all categories at once and to compare, for instance, the 
categorization of selected genes with that of all the genes in the array. There 
exist some tools perfor ming this type of compa risons, such as goTools, a 
Bioconductor package ( Paquet and Yang ( 2005| )) taking this point of view 
but allowing only for visual comparisons. 

Recent works are developing tests which can also be applied to analyze a 
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whole set of genes selected a priori (see iMootha et all (|2003h or ISmvthl 
(2005)). Although they are related in spirit with the previous approaches, 
the way they proceed is totally different and will not be discussed here. A 
comparison seems however interesting and will be presented elsewhere. 



This work is devoted to the modelling and analysis of functional profiles adopt- 
ing the intermediate position just described, that is looking at the profile as a 
whole more than as a set of unrelated categories. It will be shown that func- 
tional profiles characterize the set of selected genes and that they can be used 
to perform interesting comparisons such as over-expressed vs. under-expressed 
genes, or between arrays of different brands. 



2 Statistical models 



A functional profile can be seen as a numerical vector with named cells. Each 
cell corresponds to a category in a given ontology, usually, but not necessarily, 
at the same level in the GO. Each category can be characterized by a unique 
number (GO:nnnnnnn) and a descriptive name. Saying that a node is at level k 
means that the shortest path between this and the main node in each ontology 
(MF, BP or CC) has k — 1 nodes. The cell number represents the number of 
genes whose path to the base level has a node in this category. 

Table 1 shows a functional profile for a set of 140 genes clasified at the second 
level of the Molecular Function Ontology. 

An important thing to have in mind when one considers analyzing data start- 
ing from profiles like that in Table 1 is that building this table suppresses the 
structure of the original data, as any categorization does and, as a result of the 
possibility of a gene to belong to more than one category, the sum of counts is 
higher than the number of genes, and in consequence a different model than 
the usual multinomial one is needed to represent cell counts. 



3 The profile distribution 



In order to overpass the problem that a multinomial model is not adequate to 
represent a functional profile, the following strategy is adopted. Given a profile 
with s categories (s = 8 in table 1) let Q = {A\, A s } be the space of events 
corresponding to observing one individual in one of the categories l,...,s. Given 
that it is possible that the same gene belongs to multiple categories we must 
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Fig. 2. A simple functional profile, based only on 3 genes, showing the fact that a 
given gene may appear in different categories 

consider, instead, the space of events 

IT = {A 1 ,A 1 xA 2 ,...,A 1 x A s , A 2 , A 2 xA 3 ,..., A s ^ x A s }, 

where Ai means that a gene has appeared only in category % and Ai x Aj 
means that it has appeared in both % and j. For simplicity we make our 
development assuming that the multiplicity is only for two categories, but it 
is straightforward to extend it to more than two. 

Taking this crossed-structure approach, each gene will appear only once, at 
most, in each category so that a given experimental situation may be charac- 
terized by an expanded profile 



nV = n(pu,p 12 , ...,pi s , ...,P(s-i)s,PssY 



so that the sample expanded profile, nV, is associated with a multinomial 
distribution: 

nV~M(n;V) (2) 

where n is the number of genes forming the profile (that is, classified at a 
given level of a given ontology), pa is the probability of a gene to belong only 
to class Ai, pij = pji is the probability of that gene belonging simultaneously 
to classes Ai and Aj and pa and p^ are the corresponding realizations from a 
sample of size n. 
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In practice we are interested in the distribution of the contracted variable 
nP = (npi ,np2., ...,np s .)* which represents the profile, that is, the counts in 
categories (A\_, A 2 ., ■ ■ ■ , A s ), where 



A i . = A 4 \j[\jA j xA i \\j[\jA i xA j 

v<* / \J>* 

and pi, represents the probability of Ai_. The distribution of nP is established 
in the following theorem: 

Theorem 1 The random variable nP = (npi., np2., ■ ■ ■ , np 8 .Y is asymptoti- 
cally distributed as a multivariate normal distribution 

N(n/i,nS), 

where: p, = (pi.,P2.> • • • ,Ps.Y, ^nd S = (cy), with: 

Pi.{l-Pi), i = l,...,s, (Tij = pij -Pi.pj.,i ^ j,i,j = l,...,s. 



Proof 1 The asymptotic normality ofnP follows from considering the asymp- 
totic approximation of the multinomial law to the normal distribution and the 
distributional inv ariance when a linear transformation is applied to normal 
distributions (see Serflind 1 198&) ). where the transformation is described by: 



CV 



(3) 



and C = \ C^\C^\ . . . |C^J is a matrix with s rows and s- (s + l)/2 columns 
defined, by boxes, as: 
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with 
where: 



cw = (4) , 



1 if i = h or (j = i — h + 1 and i > h + l) 
elsewhere. 



(5) 
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In the following we will use the term "profile" indistinctly to refer to the 
absolute frequencies or to the pair formed by the relative frequencies and the 
number of genes, n. 



4 Comparison of profiles 

In many practical applications the user is interested in comparing profiles. 
This is meaningful in a variety of situations, let it be to compare the profile 
obtained from a set of over or under-expressed genes with all the genes on the 
array, to compare the profiles obtained in different experimental conditions or 
to compare the profiles corresponding to arrays of different types or brands. 

Our approach is based on defining an appropriate measure of distance between 
profiles d(Pj,Pj). This allows to state the problem of comparing two profiles 
in terms of testing the hypothesis H : d(Pi, Pj) — vs i?i : rf(Pj, Pj) > 0. 

The choice of the appropriate distance is often a point for extensive discussion. 
Sometimes the underlying statistical model is relevant for its choice. In other 
cases the availability, or even computational feasibility is decisive. Here we 
will use the squared Euclidean distance which offers a good balance between 
ease of interpretation and properties that can be derived for it. 

One may consider different scenarios for working with this problem: 

• One-sample problems consist of comparing an estimated profile with a fixed 
one, which makes the role of "population". This may be, for instance, a 
profile obtained from the whole array or even the genome of the species if 
it is available. 

• Two-sample problems consist of comparing two (or more) estimated pro- 
files, which are obtained from populations which may be or may not be 
independent. This may be for instance the case of profiles formed with the 
genes selected in two different experiments about the same disease, or those 
obtained with the genes selected from two (or more) different mutations of 
a given wild type. 

Only the first case will be discussed in the following. Two sample problems 
will be presented elsewhere. 

4-1 Main results 

Let Po represent a fixed population profile, and P, an estimated profile based 
on a sample of size n. The squared Euclidean distance between P and P is 
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defined as: 

d(P, P ) = (P - P )* (P - Po) = E(Pi. - Pio?- (6) 

i=i 



Based on Theorem 1, the distribution of the distances can be established 
setting the basis to perform statistical inference on the estimated profiles. 

Theorem 2 Let P represent a fixed population profile, P an estimated pro- 
file based on a sample of size n and d(P, P ) the squared Euclidean distance 
between Po and P, 

(1) If P ^ P then 



n 1 ' 2 



d(P,Po)-d(P,P„) ^Z~N(0,a 2 ), (7) 



where: 



o- 2 = 4 J2 (P<- ~~ P*o) (**iP». + (1 - &ij)Pij ~ Pi.Pj) (Pj. ~ Pjo) 

= 4(P-P ) t E(P-P ), (8) 

r 

and ' — > " stands for "convergence in distribution" . 
(2) IfP = P then 

nd{P,P )-^J2^xli, (9) 

i=i 

where fa are the eigenvalues of matrix E defined in Theorem 1 and xl ; 
are independent chi-squared random variables with one degree of freedom. 

Proof 2 Consider the algebraic relation: 



d(P,P c 



P-Pn P-P 



P-Pl 



= ((p - p) — (Pq — r 

: d(P, P) + d(P, P ) + 2(P - P )*(P - P). 



(10) 



Note also that the asymptotic distribution of nd(P, P) follows from standard 
results about quadratic forms Wik and Guns\ \198A) ): 



n d(P,P) 



n (P-PVL(P-P) 



8=1 



where I s is the s— dimensional identity matrix and fa are the eigenvalues of 



I S E = E. 
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Let P ^ Pq. If we take the algebraic relation (10), the following holds: 



n 1 ' 2 



d(P, P ) - d (P, P )l = n 1 ' 2 d(P, P) + 2U 1 ' 2 (P - P )* (P - P). (12) 



Then, as n 1 ^ 2 d(P, P) — > and following Theorem 1, the first part of 
Theorem 2 is established. 

Let P = Pq. In that case we simply have that the first two terms of (10) 
become null, so that 

n d(P, P ) = n d(P, P) = n (P - P)%(P - P), (13) 

and the second result follows from (11). 

The proof of Theorem 2 is based on some properties of the squared Euclidean 
distance. It can be easily extended to other smooth distance indexes with the 
only condition that they admit a Taylor series expansion. In that case the 
approximation appearing in (11) would be based on the eigenvalues of if^E 
where Ht is the Taylor Hessian of the distance expansion. It must be noted, 
however, that with squared Euclidean distance, error terms depend exclusively 
on the convergence of the multinomial to the normal distribution given that 
terms of order greater than two vanish in the Taylor expansion. In absence 
of other constrains, such as biological interpretation, this is an additional 
argument favoring the use of this distance in front of other options. 

In practical situations it is ofte n difficult to deal wit h linear combinations 



of chi-squared random variables. iRao and Scottl ([19811 ) suggested to use the 



following approximation for the combination introduced in equation 9: 



i=\ 

where 

and Xranfcs stands for a chi-square random variable with rankT, degrees of 
freedom. 

Our simulation results show that the above approximation performs very 
poorly in our case. In this work we have used a similar approximation which 
we consider to have better adaptability properties: 



i=l 



PiXli ^ OXranfcE + b ' ( 15 ) 
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where 



E(xl)=aE{ 



^CrankT, 



b and Var (xfj = a 2 Var (xL^s) (16) 



giving: 



£ A 2 and & = 



i=l 



EA 2 



(17) 



Applications 



These results make it possible to construct hypothesis tests and confidence 
intervals to perform statistical inference on the profiles. 

An approximate confidence interval for the "true" distance g?(P,Po), with 
approximate confidence level 1 — 2a is 



d(P,P ) -2a 4= > d(P,P )+z a -^= 
n \/n 



where a = 2y (P — Po) S ^P — Po) is the sample standard error estimator of 

nd(P,Po), directly obtained from (8), and z a stands for the a right quantile 
of a standard normal random variable Z, Pr {Z ^ z a } = a. 

Consider now the contrast 



H : d(P,P ) = 
H x : d(P,P )>0 



(19) 



(Say, is the set of differentially expressed genes enriched / impoverished in some 
GO categories with respect to all the genes in the array?) From (7) the rule 



"Reject ffoifWE AxL > nd )< «" , 



(20) 



\i=l 



defines a test of nominal size a where d stands for the sample squared euclidean 
distance value. 



There exist approximate methods to compute tail proba bilities of linear combi- 

nation s of independent chi-square random variabl es. Se e Sheil and O'Muircheartaigh 
(1977) for the case of non-negative coefficients or iFarebrother (1984) : 'or more 
general algorithms. 



These algorithms are computationally complex so that we have taken a more 
direct approach based on simulation. It consists of estimating the probability 
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in (20) by means of the relative frequency 

m 

where Yj, j = 1, . . . , m, are independent realizations of J27=i PiXu- 

In the examples and simulations described in the next sections, we have taken 
m = 10,000. 

Similarly, the rule 

"Reject H if P [x 2 rank ^ > — — ) < a", (21) 

defines an alternative test procedure. Here a and b are defined in (17). This 
last method is slightly easier to compute as no simulations nor complex ap- 
proximations are required to obtain the critical value. Note that E used in 
the test procedures is the known covariance matrix associated to the known 
profile specified by the null hypothesis. 



5 Example. Biological interpretation of a list of genes 



CDA+ T-cells are a type of white-blood cells which are very important in 
the organism immune surveillance. As an example of the many processes in 
which they are involved it is known that the decrease in number of CDA+ 
T-cells is the primary mechanism by which HIV causes AIDS. The activation 
of T-cells is related to the presence of a substance, L-selectin (CD62L). This 
molecule may be absent or present in a cell yielding two possible types of 
cells: CDA + /CD62L— T-cells lack L-selectin expression, whereas CDA + 
/CD62L+ T-cells present L-selectin expression. 



iHengel et al. (2003) performed a study aiming at finding the genetic reg- 



ulation of these differences. They found 144 genes to be up-regulated in 
CDA + /CD62L- T-cells relative to CDA + /CDQ2L+ T-cells. The list is 
available in the supplementary material website. All the computations in this 
work have used only 140 of these genes, because there are 4 which were not 
annotated in the GO. 

Table 1 shows the functional profiles for this list of genes at the first level of 
the three ontologies. 

There are two reasons why these profiles are not necessarily very informative. 
First, for simplicity, we have built the profile at the highest possible level 
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formed by very generic groups. Even if these functionalities are too general 
one might rely on them for interpretation, but first we need to know if this 
profile characterizes the selected genes or it is simply the profile corresponding 
to the population, or in this case, to a big sub-population formed by all the 
genes in the microarray. In order to answer this second question a comparison 
of profiles is meaningful. Figure 3 and table 2 show the comparison between 
the sample (140 genes in the list) and the population (all the genes on the 
chip) profiles. It can be seen that there does not appear to be any significant 
difference in BP and MF profiles, but there is some for Cellular Component. 

Table 2 shows the distances and p-values computed using the two methods de- 
scribed above. The test performed here has null hypothesis H : d(P, Pq) = 0, 
that is, not rejecting the null hypothesis means that the set of genes consid- 
ered to be differentially expressed is distributed between GO categories in the 
same form as all the genes in the array. 



Description 


GOID 


Frequency 


catalytic activity 


GO:0003824 


38 


signal transducer activity 


GO:0004871 


21 


structural molecule activity 


GO:0005198 


2 


transporter activity 


GO:0005215 


16 


binding 


GO:0005488 


76 


antioxidant activity 


GO:0016209 


1 


enzyme regulator activity 


GO:0030234 


4 


transcription regulator activity 


GO:0030528 


10 


Total number of hits 




168 



Table 1 

Functional profiles at the first level of the Molecular Function Ontology for the list 
of genes in the example data 



Ontology 


Distance ± precision 


LC-chi p-value 


Approx-chi p-value 


MF 


0.00734 ± 0.013114 


0.5744 


0.6062 


BP 


0.00511 ± 0.009710 


0.4112 


0.4611 


CC 


0.05749 ± 0.053666 


0.0000 


3.11e-07 



Table 2 

Distances and p-values computed on first-level profiles for the three ontologies for 
the example data. "LC-chi" stands for the test based on linear combinations of 
chi-square random variables (20) and "Approx-chi" stands for the test based on 
approximating a chi-square distribution (21) 
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Functional profile for TCells 
BP ontology 



regulation of biological process 

«5 



physiological proceS 
developing 
biological_process unknot 



□ 


All 


□ 


Selected 



Functional profile for TCells 
CC ontology 



;in complex 
organetS 
llular matS 

nt unknot 



□ 


All 


□ 


Selected 



main parameters: a sample size n and two expanded profiles V and Vq which 
induced the profiles P = CV and P = CVq. 

For each one of the simulation scenarios characterized by a given combination 
of the above parameters, 10,000 (expanded) sample profiles were generated 
according to a multinomial distribution Ni(V\ n) and contracted according to 
(3). For each generated profile, the test procedures (20) and (21) were applied 
in order to determine whether the null hypothesis in (19) was rejected or 
not, and the confidence interval (18) computed to determine its length and 
to inspect its coverage of the true distance d (P, P ). These simulation results 
were collected to estimate the true rejection probabilities, the mean interval 
length and the true coverage probability. 

In a first series of simulation experiments, the profiles specified in the preced- 
ing example were taken as directly defining the population and/or the null 
hypothesis, with n = 140. 

Table 3 displays the (simulation estimated) probability of rejecting Hq, the 
mean length of the confidence interval and its coverage. All results correspond 
to a nominal significance level of 0.05 or to a nominal coverage of 0.95. 

Hq is true when the profile generating the gene samples and the profile spec- 
ifying H are the same. In these cases, both tests seem to perform according 
to the nominal significance level, with an apparent tendency of test (21) to 
slightly higher type I error probabilities. As can be expected, the confidence 
interval (18) is not adequate when Hq is true, with a greater coverage than the 
nominal one and a low precision (the intervals are too long in mean). When Po 
corresponds to CD62L and P to hgu95A, H is not true, though both profiles 
are very similar in the BP and MF ontologies, with "true" squared Euclidean 
distances of 0.0020 and 0.0064, respectively. For n = 140, the power of the 
tests is low, around 0.14 for BP and 0.36 for MF. That is, for these quite 
similar "true" profiles, the probability of type II error is high and the confi- 
dence interval still performs not adequately. On the other hand, the simulated 
profiles differ appreciably more for the CC ontology. Then the power of the 
tests is much greater, around 0.91, but the coverage is still inadequate, now 
too low. 

In order to have a more comprehensive view, we performed an additional 
simulation study. The following geometric model was considered: 

Let V = (pii,p 2 2, • • ■ ,Pss,Pi2,Pi3, ■ ■ ■ ,P(s-i)s,Pi23, ■ ■ •) represent an expanded 
profile. Maintaining the order of the elements, recode the indexes as 

(P[1],P12], ■ ■ ■ ,P[s],P[s+l],P[s+2], ■ ■ ■ n y- • 
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Profile specifying Hq 




Onto 


CD62L 


hgu95A 








BP 


n f)4S3+f) 0042 








LC-chi 


MF 


0.0439±0.0040 










CC 


0.0475±0.0042 










BP 


0.0529±0.0044 








Approx-chi 


MF 


0.0524±0.0044 










CC 


0.0475±0.0042 






CD62L 
















BP 


0178±7 58E-5 








Interval length 


MF 


0.0220±8.59E-5 










CC 


0.0092±7.01E-5 










BP 


0.9951±0.0014 








Coverage 


MF 


0.9883±0.0021 










CC 


0.9986±0.0007 




Simulated profiles 


















BP 


i S97+D 0068 

\J . 1 *jU i —1—\J. \J\J\J(J 


0527+0 0044 






LC-chi 


MF 


0.3555±0.0094 


0.0550±0.0045 








CC 


0.9057±0.0057 


0.0453±0.0041 








BP 


0.1455±0.0069 


0.0557±0.0045 






Approx-chi 


MF 


0.3692±0.0095 


0.0574±0.0046 








CC 


0.9198±0.0053 


0.0518±0.0043 




7 n r A 

hgu95A 
















BP 


0.0215 ±8.63E-5 


0.0183±7.89E-5 






Interval length 


MF 


0.0292±8.97E-5 


0.0215± 8.65E-5 








CC 


0.0313± 0.0001 


0.0042±3.37E-5 








BP 


0.9907±0.0019 


0.9943±0.0015 






Coverage 


MF 


0.9794±0.0028 


0.9868±0.0022 








CC 


0.8896±0.0061 


0.9842±0.0024 



Table 3 



Probability of rejecting Hq, mean length of the confidence interval for the true 
squared Euclidean distance and coverage. Simulated profiles were generated accord- 
ing to CD62L or hgu95A. All simulation results are displayed with ±95% confidence 
limits. "LC-chi" stands for the test based on linear combinations of chi-square ran- 
dom variables (20) and "Approx-chi" stands for the test based on approximating a 
chi-square distribution (21) 
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and assume that 

p m oc (i - ey- 1 , for o < e < 1. 



(22) 



The sole purpose of model (22) is to define families of profiles fully character- 
ized by a unique parameter 9, for a given number of ontology classes s and a 
given maximum level of possible simultaneity k -that is, with the last term 
having index (s — k + 1) (s — k + 2) . . . (s — 1) s. Progressively different values 
of 9 produce progressively greater distances between profiles, that is, scenarios 
progressively distant from the assumption of validity of the null hypothesis. In 
all the simulations, H was defined by a profile associated to 9 = 0.4, while 
9 = 0.4, 0.35, 0.30, 0.25 and 0.20 defined possible scenarios were the sample 
profiles were generated according to distributions more and more distant from 
H . Additionally, the following sample sizes (number of genes) were consid- 
ered: n = 50, 100, 140, 200, 500 and 1000. Here we report the results for s = 6 
and k = 5 (as in the CDA + /CDQ2L T-cells example for the BP ontology) for 
a nominal significance level of 0.05 and nominal coverage of 0.95. The results 
of other situations (including s — 11, k — 6 and s = 4, k = 2, respectively 
the case of the MF and CC ontologies) are accessible in the supplementary 
documentation web site. 

Figures 4 and 6 display the power curves of both tests under consideration. 
They perform in a very similar way and always seem to be in conformity 
with the nominal significance level. Thus, the test based on the chi-square 
approximation seems to be preferable due to its simplicity. 

Figure 6 corresponds the coverage of the asymptotic confidence interval (18). 
As is expected, this confidence interval is not adequate under true H . When 
H is false, only for large sample sizes (500 or more genes) its true coverage 
approximates the nominal one. Otherwise the true coverage is larger than the 
nominal, but at the cost of a very low precision (that is, too long intervals) as 
is shown in Figure 7. For example, if n = 50 genes, with a true 0.005 distance, a 
(nominally) 95% confidence interval will have a length of approximately 0.04, 
too wide with respect to the magnitude of the distance. 
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6 GO nodes, max. simultaneity 5 




Fig. 4. Power of the test based on linear combinations of chi-square random vari- 
ables. The bottom horizontal line corresponds to the reference 0.05 significance 
level. 
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6 GO nodes, max. simultaneity 5 




Fig. 5. Power of the approximated chi-square test. The bottom horizontal line 
responds to the reference 0.05 significance level. 



19 



6 GO nodes, max. simultaneity 5 
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Fig. 6. Coverage of the asymptotic confidence interval. The base reference line cor- 
responds to a 0.95 coverage. 
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6 GO nodes, max. simultaneity 5 
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Fig. 7. Mean confidence interval length 
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7 Discussion and Conclusion 



The analysis and interpretation of biological data based on the Gene Ontology 
is an active field of research. 



Functional profiles constitute an intuitive way to summarize sets of genes of 
any size to facilitate biological interpretation. 



Our theoretical results set the basis for doing statistical inference based on 
these profiles. This allows to turn the analysis of p rofiles from a simple graph- 
ical comparison, such as is done in many papers dBeltran et al.l ( 20031 )) or in 
the goTools Bioconductor package ( Paquet and YaneP 20051 )) to well based 
inferential procedures with a known degree of confidence. 



Essentially we have considered global comparisons between profiles at fixed 
levels of the GO, but extensions are straightforward. For instance, profiling 
may be performed on any set of reference categories, not necessarily a fixed 
level. Also, the theory can be easily adapted to other situations such as the 
analysis of multiple response items in surveys. 



The approach is, of course, not free from limitations. Profiling, as any other 
summary, implies a certain loose of information. Comparing the approach with 
the use of the whole graph it is clear that the later has more information but 
is more difficult to summarize. If we go in the other direction, a category by 
category analysis ( "a la fatiGO" ) helps to see what happens in specific inter- 
esting categories but does not offer a global approach. In brief our approach 
tries to stay between one and the other in a useful way. 



7.1 Software and tools 



The main applied interest of this work is to provide the genomic community 
a research tool that help to assess their conclusions, allowing to go one step 
further than visual approximations such a s that offered by some p rograms, 
such as the Bioconductor package go Tools ( Paquet and Yand ( 20051 )). 



To facilitate the application of our results we have developed a tool which is 
available as an R package which will be freely accessible to the user's commu- 
nity. This will also be submitted to Bioconductor to help its diffusion. Also, a 
web site to make the software available through the web is in development. 
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